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investigative data and final disposition of service. Several noteworthy relationships were 
found between derogatory information developed in the investigation and the subse- 
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I. INTRODUCTION 



A. BACKGROUND 

The importance of protecting sensitive military information and operations from 
potentially hostile sources is a concept as old as warfare itself. Events of the recent past 
indicate that the nation must never grow complacent about its ability to safeguard clas- 
sified information. World-wide defense commitments, the ideological and historic dif- 
ferences existing between the US and other nations, and the huge number of people who 
frequently access, create, analyze and service the vast amount of sensitive information 
combine to create a tremendous managerial problem: Who can be trusted with access 
to the nation's security secrets? 

The need to investigate the backgrounds of those people needing access to classified 
information has been a fixture of the national security establishment for many years. 
Typically, an individual, by virtue of his duty responsibilities, is determined to need reg- 
ular access to sensitive information of some level (secret, top secret, sensitive compart- 
mentalized information, etc). A fairly standard administrative procedure is employed 
throughout the Department of Defense (DOD) in order to determine whether the person 
should be allowed access to classified information, 

B. THE SECURITY INVESTIGATION PROCEDURE 

The first element in a security investigation is the completion of a detailed form 
named the Statement of Personal History (SPH). The SPH requires specific information 
about a person's past. Information such as a list of close family members, foreign travel, 
arrests and convictions, schools attended, jobs held, creditors, and personal references 
are all required. The SPH is the starting point for any security investigation. 

The next step in the investigation consists of the National Agency Check (NAC) and 
the Local Agency Check (LAC). Law enforcement agencies, both local (i.e., city or state 
police) and national (i.e., the FBI) are queried about outstanding warrants and records 
of arrests. A check of credit information is also conducted with national and local credit 
bureaus to determine whether an individual has money problems. 

The clearance will normally be granted to a person who requires access to informa- 
tion with a classification of Secret or lower when the above procedure does not turn up 
any inconsistencies. 
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A person requiring access to top secret or higher level information will undergo a 
much more detailed investigation: a background investigation (BI), or a special 

background investigation (SBI). These investigations are much more thorough than 
those for lesser clearances and involve actual interviews with people who know and have 
developed a relationship with the individual being investigated. Neighbors, friends, 
school officials, former employers and others may be interviewed. If the answers are 
consistent and positive, the subsequent investigation will be much less detailed than if a 
negative trend develops and other sources of information are "developed" by the inves- 
tigators. If information is developed which contradicts that listed on the Statement of 
Personal History or is conspicuously absent from it, the subject will almost certainly 
be interviewed. In certain other types of investigations, an interview is always required. 

The result of this investigation is a dossier containing basic biographical data, de- 
rogatory information obtained from the SPH and other sources (or lack of such infor- 
mation) and recommendations as to the trustworthiness of the subject of the 
investigation. Derogatory information varies from traffic infractions to emotional 
problems to felonies. All the investigative data is gathered for the clearance determi- 
nation. An adjudictor reads the investigation file and makes the judgement as to the 
award of the clearance. 

The last step in the security investigation process is a review of the information ob- 
tained and determination of whether the clearance should be granted. 

Review of the information is performed in accordance with Adjudication guidelines 
contained in the DOD Personal Security Regulation, DOD 5200. 2-R, dated January, 
1987. The factors which can disqualify an individual for a clearance are listed as well 
as the mitigating factors which might allow a clearance to be granted even though a 
disqualifying factors are present in the information. For example, a person might admit 
to experimental use of marijuana (less than six instances of use) in their adolescence. 
This use of cannabis (marijuana or its derivatives) is considered a disqualifying factor. 
A mitigating factor in this instance is that the experimental abuse occurred more than 
six months ago, and the individual has no intention of using cannabis or other drugs in 
the future [Ref. 1]. 

The final determination of clearance for an individual whose record contains dis- 
qualifying information is a subjective one. It is based upon the merits of the case, and 
the evaluation of the adjudicator as to the mitigating factors which hopefully indicate 
the actual reliability of the individual in the future. 
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C. BACKGROUND OF THE SPECIAL BACKGROUND INVESTIGATION DATA 
BASE (SB ID) 

It is apparent that the investigation procedure must generate a tremendous amount 
of data about every person who is investigated for a security clearance. It is clear that 
we do not wish to trust national security information to those who are untrustworthy 
enough to violate laws, regulations, and accepted standards of conduct. Could this data 
be used to examine whether data obtained from the security investigations were in any 
way related to the future service record of those investigated? Could this data provide 
insight into the investigation process, allowing investigative resources to be more effi- 
ciently allocated? 

The Defense Personal Security Research and Education Center (PERSEREC) in 
Monterey, California was directed to examine a large sample of data produced from se- 
curity investigations of first-term enlistees entering the Navy during the years 1979 - 
1982. The purpose of the study was to develop insight about the information developed 
in security investigations, especially when the final disposition of service of investigative 
subjects was known. 

The individuals whose records were involved in the study: 

1. Had background investigations initiated within three months of enlistment: 

2. Were separated or discharged during, or upon completion of their initial tour of 
duty; 

3. Were discharged for homosexuality, misconduct, drug abuse, court martial, char- 
acter and behavior disorder, or normal completion of enlistment. 

Thus, in the data base, there are five types of unsuitability discharge categories and 
one control group of personnel who successfully completed their term of service. 

Seven-hundred records were selected randomly (based upon the last digit of the so- 
cial security number) for the study. One-hundred cases were selected from each of the 
five unsuitability discharge groups and two-hundred cases in which the individuals were 
normally separated. The number of cases w'hich were eventually included in the study 
numbered 564 because those cases where the Background Investigation was cancelled for 
any reason were removed. 

The number of records chosen in each category were not in relation to the charac- 
ter-of-service category's proportion in the actual population. An immense number of 
records would need to be drawn as a single sample in order to get a large enough rep- 
resentation from each adverse discharge category. As an illustration, consider that there 
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are 73 records in this data base from the court martial character of service category. 
Persons who are investigated receive this adverse character-of-service designation ap- 
proximately 0.18% of the time. Simple arithmetic indicates that to get approximately 
73 records in this category' from a single sample from the investigation population at 
large would require a sample size of nearly 41,000. It seems obvious that this is not 
reasonable. Table 1 displays the approximate percentages of those initially investigated 
who receive each of the six character-of-service designations discussed in this thesis [Ref. 
2]. There are other designations which are not considered here. 



Table 1. CHARACTER OF SERVICE CATEGORY PROPORTIONS 



Character of Service Category' 


% of Investigation Population Receiving 
Category 


Good 


90.4% 


Homosexual 


0.92% 


Misconduct 


1.2% 


Drug 'Alcohol Abuse 


1.8% 


Court Martial 


0.18% 


Character Behavior Disorder 


0.65% 



The data base was created by taking the investigation information from microfiche 
and entering it into a Lotus 123 spreadsheet. There were 93 possible entries for each of 
the 564 records resulting in a total data base with the potential for approximately 52,500 
data points. 

The data was essentially categorical in nature with an individual record containing 
personal information ranging from date of birth and military' specialty to findings from 
high school to type of discharge. A four-digit code representing the type of derogatory' 
information was the prime means of listing this data and allowed standardization across 
the data base. Other codes were created to represent other pieces of information such 
as the recommendations obtained at the various sources (high schools, colleges, neigh- 
borhoods, etc.), race or marital status. 

Problems with the size of the data base, the slow response of an AT-style micro- 
computer when dealing with such a large data set, and the limitations of Lotus 123 in 
performing statistical functions allowed only a cursory' analysis of the data base as 



4 



originally implemented. Clearly another approach was necessary to analyze and obtain 
insights from this data. 

D. PURPOSE 

The purpose of this thesis is two-fold: to investigate some available methods for 
organizing and analyzing a large, categorical data base; and to use statistical and data- 
analytical techniques to evaluate the personal security data detailed above in order to 
develop insights and correlations between the security investigation data and the subse- 
quent disposition of the subject's term of enlistment. 

E. LIMITATIONS 

The data used in this paper was analyzed as provided. It was not possible to ensure 
actual random selection of the data, however we assume that each sample was selected 
randomly. The data was selected in an arbitrary manner (one-hundred records from 
each of the unsuitability discharge categories and two-hundred records with normal 
completion of service). It may be difficult to apply the results of this investigation to the 
general population. 

F. ANALYTICAL TOOLS USED 

The data was initially reduced and documented using the Statgraphics (version 2.6) 
statistical software package on a Compaq 286 portable personal computer with two 
megabytes of additional random access memory (RAM). After reduction it was trans- 
ferred to an IBM 3033 System 370 mainframe computer using the MVS batch system. 
On the mainframe computer, Grafstat, an unreleased IBM mainframe data analysis and 
statistical package was used. In addition, APL programs for categorical analysis were 
written using APL Graphpak to supplement the routines available in Grafstat. 

G. ORGANIZATION OF THESIS 

Following this introduction, the data reduction techniques used for this thesis and 
the lessons learned from that effort are discussed in Chapter II. The main body of the 
thesis is contained in the Chapter III and deals with the data operations and the analysis 
conducted. Chapter IV discusses some promising areas for further analysis which were 
only briefly pursued because of time constraints. The closing chapter will summarize the 
results of this research, set forth the conclusions drawn from those results and provide 
recommendations for future research involving this data. 
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II. DATA REDUCTION 



A. GENERAL 

PERSEREC experienced problems in attempting to analyze a data base of this 
magnitude. This led them to investigate other methods of configuring the data in order 
to perform the analysis they felt was necessary. Subsequently, the Lotus 123 files were 
exported to the mainframe computer and configured into Conversational Monitoring 
System (CMS) ASCII files. The categorical nature of the data and its overwhelming size 
dictated that documentation and verification of the data base was necessary’ before any 
further useful analysis could be performed. However, the data editors available in CMS 
on the mainframe computer did not offer the ability to easily operate on column fields 
and did not have the flexibility needed to simultaneously document the work performed 
as it proceeded. 

B. DATA EDITING 

Statgraphics (version 2.6) offered a user-friendly data editor offering the requisite 
capabilities. Unfortunately, it was available only on a personal computer. A Compaq 
2S6 portable AT-compatible micro-computer with two megabytes of additional memory 
(useable as a virtual disk) was used. It proved extremely useful: however, its size limited 
the amount of data which could be operated upon without exceeding the memory limi- 
tations of the computer (these memory restrictions will be alleviated in the future when 
using the new 803S6 based machines). 

The CMS files were transferred into micro-computer ASCII files and then stored 
on floppy disks and subsequently read into six Statgraphics (ASF) files. Each of the files 
consisted of approximately 15 of the variable entries for each of the 564 records (ap- 
proximately 8400 data points). At any one time six or seven of these variables could be 
operated upon within the data editor. 

A general procedure wasjfollowed in formatting and verifying each of the six files. 
First, the file was checked to insure that the data, as it existed on the CMS files, had 
been transferred correctly. In one instance half of the field of one variable was truncated 
and had to be reconstructed. 

Next, the numeric coding used for each column was researched and ambiguities re- 
solved by recoding or removal. This step required considerable research into the coding 
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methods and the investigation process in order to understand, and, if necessary, change 
the numeric codes for the sake of clarity. 

Finally, a frequency tabulation of each column was performed and labels were cre- 
ated which corresponded to the coded values. These labels were especially useful later 
in the analysis when cross-tabulations between variables vectors were conducted. 

The procedure discussed above was iterative as sometimes several interpretations 
resulted before one was confirmed as correct. Documentation of the data base was 
conducted throughout these three steps. The list of the variables contained in the data 
base, their purpose and their types are contained in Figure 1 through Figure 3 . These 
figures are a direct copy of the file management screen that appears in Statgraphics as 
you enter the full-screen editor or view the data directory. Comments are limited to 21 
characters for each variable. 



VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


A 


5 


I 


1 


564 


3/18/88 


11: 59 


RECORD NO. (RANDOM) 


C 


3 


I 


1 


564 


2/26/88 


11: 08 


SEX (MALE OR FEMALE) 


D 


8 


D 


1 


564 


3/18/88 


13: 02 


BIRTHDATE 


F 


8 


D 


1 


564 


3/18/88 


14: 01 


DATE OF ENTNAC 


G 


8 


D 


1 


564 


3/18/88 


14: 01 


BI REQUEST DATE 


I 


3 


I 


1 


564 


2/26/88 


11: 08 


REASON FOR BI 


J 


4 


I 


1 


564 


2/26/88 


11: 08 


OCCUPATION CODE 


K 


3 


I 


1 


564 


2/26/88 


11: 10 


REASON FOR INTERVIEW 


LI 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 1. 


L2 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 2. 


L3 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 3. 


L4 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 4. 


Ml 


6 


I 


1 


564 


2/26/88 


12: 29 


FBI/DCII FINDINGS1 


M2 


6 


I 


1 


564 


2/26/88 


12: 29 


FBI/DCII FINDINGS2 


N1 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


N2 


6 


I 


1 


564 


2/26/88 


14: 31 


LOCAL AGENCY CHECK 


N3 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


N4 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


01 


6 


I 


1 


564 


2/26/88 


14: 03 


CREDIT BUREAU CHECK 


02 


6 


I 


1 


564 


2/26/88 


14: 03 


CREDIT BUREAU CHECK 


P 


4 


I 


1 


564 


2/26/88 


10: 59 


H S - // OF SOURCES 



Figure 1. List of Variables Contained in the Data Base: Extracted from the Stat- 

graphics Data Management Screen. 



7 



VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


Q1 


3 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


Q2 


6 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


Q3 


6 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


R1 


3 


I 


1 


564 


2/26/88 


15:47 


HIGH SCHOOL FINDINGS 


R2 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


R3 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


R4 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


s 


3 


I 


1 


564 


2/26/88 


10: 59 


COLL. - // OF SOURCES 


T 


4 


I 


1 


564 


2/26/88 


11: 00 


COLL. RECOMMENDATION 


U 


5 


I 


1 


564 


2/26/88 


11: 00 


COLLEGE FINDINGS 


V 


3 


I 


1 


564 


2/26/88 


11: 00 


EMPL. # OF SOURCES 


W 


3 


I 


1 


564 


2/26/88 


10: 53 


CO-WORKER # SOURCES 


XI 


3 


I 


1 


564 


3/ 4/88 


11: 28 


EMPLOYMENT RECOMM. 


X2 


6 


I 


1 


564 


3/ 4/88 


11: 15 


EMPLOYMENT RECOMM. 


X3 


6 


I 


1 


564 


3/ 4/88 


11: 15 


EMPLOYMENT RECOMM. 


Y1 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y2 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y3 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y4 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Z 


2 


I 


1 


564 


2/26/88 


10: 54 


NEIGH. it OF SOURCES 


AAl 


3 


I 


1 


564 


3/ 4/88 


12: 02 


SPH NEIGH. RECOMM. 


AA2 


3 


I 


1 


564 


3/ 4/88 


12: 02 


DEV. NEIGH. REC. 


AA3 


6 


I 


1 


564 


3/ 4/88 


12: 02 


DEV. NEIGH. REC. 


AB1 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AB2 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AB3 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AC 


3 


I 


1 


564 


2/26/88 


10: 56 


it OF OTHER SOURCES 


ADI 


3 


I 


1 


564 


3/ 4/88 


15: 17 


OTHER RECOMM. 


AD2 


6 


I 


1 


564 


3/ 4/88 


15: 17 


OTHER RECOMM. 


AE1 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE2 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE3 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE4 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AF 


2 


I 


1 


564 


2/26/88 


10: 46 


RACE 



Figure 2. List of Variables Contained in the Data Base (Continued): Extracted 

from the Statgraphics Data Management Screen. 
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VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


AG 


2 


I 


1 


564 


2/26/88 


10: 48 


MARITAL STATUS 


AJ 


2 


I 


1 


564 


2/26/88 


10: 48 


DEPENDENTS 


AN 


6 


I 


1 


564 


2/26/88 


10: 48 


# OF SIBLINGS 


AO 


3 


I 


1 


564 


2/26/88 


10: 48 


PERMANENT RESIDENCE 


AQ 


7 


I 


1 


564 


3/18/88 


14: 17 


ENLISTMENT DATE 


AR 


5 


1 


1 


564 


2/26/88 


10: 48 


AGE AT ENLISTMENT 


AS 


4 


I 


1 


564 


2/26/88 


10: 49 


MONTHS HS TO ENLIST 


AT 


4 


I 


1 


564 


2/26/88 


10:49 


# JOBS HS TO ENLIST 


AU 


3 


I 


1 


564 


2/26/88 


10:49 


# MONTHS UNEMPL. 


AV 


3 


I 


1 


564 


2/26/88 


10: 49 


# MONTHS COLLEGE 


AW 


3 


I 


1 


564 


2/26/88 


10: 49 


MO. UNEMPL. PRIOR ENL 


AX1 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AX 2 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AX 3 


5 


I 


1 


564 


3/11/88 


12:54 


UNFAV. INFO. ON SPH 


AX4 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AY1 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY2 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY3 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY4 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


BB 




C 


2 


564 8 


3/18/88 


14: 42 




BC 


3 


I 


1 


564 


2/26/88 


10: 33 


CLEARANCE TYPE 


BD 




c 


2 


564 8 


3/18/88 


14: 43 


CLEARANCE REV. : DATE 


BE 




c 


2 


564 8 


3/18/88 


14: 44 


DATE OF SEPERATION 


BF 


3 


I 


1 


564 


2/26/88 


10: 36 


RELEASE CODE 


BG1 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG2 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG3 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG4 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BH1 


5 


I 


1 


564 


3/11/88 


14: 29 


REMARKS/DISCHARGE 


BH2 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS /DISCHARGE 


BH3 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS/DISCHARGE 


BH4 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS/DISCHARGE 


BL 


4 


I 


1 


564 


2/26/88 


10: 08 


STATUS OF 5520/20 


BM 


5 


I 


1 


564 


2/26/88 


10: 08 


DISCHARGE CASE CODE 


BO 


3 


I 


1 


564 


2/26/88 


10: 08 


INTERSVC. SEP. CODE 


BP 


2 


I 


1 


564 


2/26/88 


10: 08 


CHARACTER OF SERVICE 


BQ 


2 


I 


1 


564 


2/26/88 


10: 08 


TYPE OF DISCHARGE 



Figure 3. List of Variables Contained in the Data Base (Continued): Extracted 

from the Statgraphics Data Management Screen. 



C. DATA REPRESENTATION PROBLEMS 

Inherent in the verification and documentation of a large data base obtained from 
an outside source are coding inconsistencies. Ideally, thorough documentation of the 
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codes used and the thought process employed in creating the data base is included with 
it. However, this is seldom the case. 

The PERSEREC data base had many inconsistencies along with several strengths. 
A major strength of the data organization was the standardization of most of the coding 
employed. Derogatory information codes (used in 43 of the 93 columns) and recom- 
mendation codes (used in 13 of the columns) were used in a fairly standard manner. The 
numeric code for all derogatory information contained in the data base consisted of a 
standard four-digit code representing 135 different infractions. The list of infractions 
and their codes is listed in Appendix B. 

The numeric code used for the types of recommendations obtained from various 
sources consisted of a two-digit integer representing the total number of persons who: 

1. Recommended the subject for a position of trust; 

2. Recommended the subject for a position of trust, with supervision; 

3. Did not recommend the subject for a position of trust; 

4. Declined comment. 

Most sources of derogatory information are represented by several columns in the 
data base. A source is considered a location such as college, high school, employer, 
neighborhood, etc.. Multiple columns are available for each source category to allow 
room for several different types of derogatory information to be displayed, if necessary. 
Table 2 shows how the information of columns VI, Y2, Y3, and Y4 (findings or derog- 
atory information obtained from employers) was represented: 



Table 2. INITIAL REPRESENTATION OF DEROGATORY INFORMATION 
(EXAMPLE). 



Record Number 


Y1 


Y2 


Y3 


Y4 


1 


9999 


9999 






2 


9999 


1071 


1106 




3 


1829 


9999 


1844 


9999 


4 


1805 


1824 







After research, these records were interpreted in the following manner: If there are 
only 9999 entries in a particular record's entries in Y1 - Y4, then no derogatory' 
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information from the subject's former employers was found. The possibility of no in- 
terview being conducted is reasonable, although all information indicates that former 
employers were visited in almost all instances. If any 9999 entries are contained along 
with derogatory information for a particular record, those 9999 codes are meaningless. 
The corrected records are shown in Table 3 . 



Table 3. REPRESENTATION OF DEROGATORY INFORMATION AFTER 
REDUCTION (EXAMPLE). 



Record Number 


Y1 


Y2 


Y3 


Y4 


1 


9999 








2 


1071 


1106 






3 


1829 


1844 






4 


1805 


1824 







In this table no information was obtained on the person represented by record 
number 1. For the second person, the investigator found evidence that the person was 
known to lie (1071), and that he was at some time intoxicated in public (1 106). The third 
person had evidence of vandalism (1829) and malicious mischief (1844). The fourth 
person was found to have an incident of reckless driving (1805) and also illegal use of a 
firearm (1824). 

Columns representing derogatory information obtained from colleges, high schools, 
neighbors, and other sources were similarly reduced. 

As discussed above, the 9999 code used in columns Y1 - Y4 represented "no derog- 
atory information." Research revealed that this interpretation of the 9999 code could 
not be used in some of the other columns. In the security investigation realm, employers 
and neighbors are considered "productive" sources. With that designation, the former 
employers and neighbors of a subject are almost always interviewed, thus the 9999 code 
for those sources means "no derogatory information." Sources other than employers and 
neighbors, on the other hand, are normally only visited by an investigator when he is 
fairly certain to obtain derogatory' information. The 9999 code in conjunction with these 
types of sources means "no interview conducted." 

An even more confusing coding scheme was discovered relating to the recommen- 
dations obtained from the five types of sources outlined above. For the employer, high 
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school, college and other sources, a 99 code represents "no interview." The coding for 
neighborhood recommendations was different. 

Neighborhoods are the source of many developed sources of derogatory informa- 
tion. A distinction was made between the recommendations of neighbors listed on the 
SPH (generally positive) and those from neighborhood sources developed by the inves- 
tigators. This resulted in four possible entries for recommendations from a subject's 
neighborhood. The column vectors representing information obtained from the subject's 
neighbor are designated AA1-AA4. Column AA1 represents the recommendations ob- 
tained from persons listed on a subject's Statement of Personal History. Entries in col- 
umns AA2-AA4 were recommendations obtained from neighborhood sources developed 
by the investigator. A 99 entry in column AA1 meant "no interview conducted," while 
a 99 entry in column AA2 means "no sources developed." Furthermore, a 99 entry' in 
columns AA3 or AA4 meant nothing. These variable fields were repaired by removing 
all 99 codes from columns AA3 and AA4. 

Another instance of miscoding occurred in column AN, which represents the num- 
ber of siblings of the subject. Throughout the field a character code of "Li" existed along 
with the usual integers ( 1,2,...) representing the number of siblings. This code was 
thoroughly researched until the only possible explanation was obtained— it represented 
"unknown." 

The problems highlighted here point to the importance of differentiating, by coding, 
even small differences in meaning when implementing codes. The failure to do so risks 
losing important distinctions which may in fact invalidate the data. Another point to 
be made is that documentation is essential when data bases are created. Luckily, the 
person who performed the data entry' was available for reference throughout the data 
reduction stage of this project, otherwise much of the information contained in the data 
base might have been lost. 

Erroneous entries were not commonly found in the data base. Only two erroneous 
codes (not of the 135 actual derogatory information codes) were found and they were in 
the same column. Research into the underlying record revealed that the codes had digits 
transposed and the corrections w r ere easily made. 

Missing values, or blanks, were common in some columns. Care had to be taken 
to preserve these blanks when transferring from one system to another. The Stat- 
graphics representation of blanks as the integer -32768 proved useful in this regard. 



12 



The files were initially represented in a random order by record number. This 
proved inconvenient when cross-validation of the record to its original file was necessary. 
The use of APL in conjunction with Statgraphics allowed all records to be reorganized 
in ascending order and made the file much easier to reference. 

Date fields were entered as six-digit codes representing month-day-year. Problems 
were encountered with formatting as Statgraphics requires a slash (/) between the month 
and day and the day and year. A simple APL function was written which performed this 
conversion. 

D. RECOMMENDATIONS FOR CODING A LARGE DATA BASE 

1. Care must be taken to differentiate even subtle variations in meaning by using dif- 
ferent codes. 

2. The data base must be designed with the proper analytical tool (software and 
hardware) consistent with the purpose and goals of the analysis. 

3. Proper documentation is essential when creating a data base. This is important not 
only for the data base creators to have for their own memory, but also so that 
others may use the data base. It is also important because others may use the data 
long after the creator has finished with it and is available to answer questions. 

4. Design of the data base should be a slow, careful affair. If this stage is neglected, 
the data base designer risks wasting many hours of work and compromising the real 
value of the data base. 

E. RECOMMENDATIONS FOR IMPLEMENTING A LARGE DATA BASE 

Statgraphics has a scrollable data editor which allows the entry, manipulation, and 
review of large data bases. It is convenient, simple to use, and, most importantly, makes 
it easy to correct and manipulate the data when anomalies are detected. 

In view of the value that such a scrollable data editor provided when reducing and 
documenting a data base which is already in existence, here are some recommendations 
for data base design. The design should: 

1. Allow for speedy input of and access to new data; 

2. Allow the data to be manipulated and massaged with scrollable full-screen data 
editors; 

3. Allow easy access by statistical graphics packages such as Grafstat. 
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III. DATA ANALYSIS 



A. GENERAL APPROACH FOR THE EXPLORATORY DATA ANALYSIS 

The primary question which this thesis attempts to answer is, "What relationships 
exist between the information derived from the subjects' background investigations and 
the final disposition of their service?" The answers obtained here will not, of course, be 
all inclusive but provide a starting point for further research involving this data base. 
In particular, this is not the only question to be answered from the data base, but as in 
much research, other questions and facts become apparent as the research progresses. 

Inherent in a data analysis is the initial investigation into the properties and limita- 
tions of the data. The PERSEREC special background investigation data (SB1D) is 
primarily categorical in nature. The record for each individual contains several different 
types of information: 

1. Background and biographical information such as age, marital status, reason for 
investigation, etc.; 

2. Derogatory information (or lack thereof) obtained by investigators from various 
sources (high school, neighborhood, employers, etc.); this information may consist 
of crimes, subject admissions, and other matters that reflect on the person's char- 
acter and judgement; 

3. Recommendations from various people associated with these sources as to whether 
they felt that the individual in question should be trusted with a position of trust 
and responsibility; 

4. The result of the term of military service, whether the individual was discharged 
normally, or due to some adverse circumstances. 

The data can viewed as information obtained prior to the completion of the inves- 
tigation (explanatory variables or independent variables) and information which is the 
result of the person's service after the investigation (response variables or dependent 
variables). 

Note that the data is basically categorical, e.g., male or female, and thus has no in- 
herent ordering. Thus, while frequency counts can be obtained and are given in Ap- 
pendix A, no distributional measures, e.g., means or variances, can be computed. 
Similarly, dependencies and associations cannot be measured by moments based upon 
joint distributions, e.g., correlation coefficients. 

The data thus appears to be ideal for contingency table methods [Ref. 3 : pp. 153 - 
170]. However, note that the one response variable in the contingency table is almost 
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